Fast Ul Draw 
UI Rendering for GPU’s only 
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Benchmark: painter-cells 
A. Draws a table of cells, each cell consists of 

1. a background, 

2. arotating and moving image 

3. arotating and moving line of text 
B. In addition, strokes boundary (with anti-aliasing) between cells 
C. Modes 

1. Can set table to rotate as a whole 

2. Can set cells to rotate separately 

3. Options on zooming, what to draw, etc.. can be changed while running (or at startup) too 
D. Ported to Qt, Cairo and SKIA 


A. Checkout branch with_ports_of_painter_cells 
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Performance Numbers on ports of painter-cells demo to 
different libraries running 25x12 (600) cells. Normalized 
to Cairo CPU backend. 


Canvas API Performance 
Cairo CPU 1.0 
Cairo GL 0.27 
Cairo Xlib 1.60 
Qt Raster 1.01 
Qt GL 0.76 
Qt Native 0.29 
SKIA GL 1.91 
Fast Ul Draw 9.33 
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Performance Numbers on ports of painter-cells demo to 
different libraries running 50x25 (1250) cells. Normalized 
to Cairo CPU backend. 


Canvas API Performance 
Cairo CPU 1.0 
Cairo GL 0.15 
Cairo Xlib 1.25 
Qt Raster 1.50 
Qt GL 0.45 
Qt Native 0.14 
SKIA GL 1.36 
Fast UI Draw 11.11 
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CONCLUSION: 


The more complex the scene the higher 
FastUIDraw performs against other 
renderers. 


OpenSource 


REASON: 
Fast Ul Draw aims to leverage GPU’s to render UI's 
NOT SIMPLE! GPU' are for throughput and GPU’s want 


bigger jobs with lots of the same work repeated. In contrast, 


CPU's are better able to handle changing work load/state. 


OpenSource 


GPU and CPU are very different beasts 


No fixed function units. Fixed functions units 
・ Samplers 
・ Rasterizers 
・ Triangle setup and clipping 


Very few rendering threads, each Many times more threads than cores, 
thread narrow. Fees thread quite wide. 
A. Switching threads within a (virtual) core is very A. GPU core has register space for many threads; 
expensive operation (Hyper-threading gives 2 switching threads very cheap. 
virtual cores per physical core). B. GPU threads are quite wide. 
B. Even with SSE, thread is only 4-wide or 8-wide C. Example: Intel HD 4000 (IVB) has 16 EU’s each EU has 
A. SSE: 4-wide 7 threads. Thread width is 8 or 16 wide (depending on 
B. AVX: 8-wide shader). 


D. GPUs hide latency with threads (and caching). 
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GPU has fixed function Darts 


*Fixed function parts have performance and power advantage over programmable parts 


eFor GPU’s we have 
- Samplers 

- Rasterizer 

- Triangle Setup 

- Triangle Clipping 


- Depth buffer (with additional tricks usually to enhance performance) 


Should leverage these as much as possible! 
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Goal: Minimize State Thrashing 


If there is not enough work between state changes, a GPU work efficiency suffers. 


Examples of state changes: 

e Changing texture source 

- Changing buffer sources or format 

- Changing shader 

- Changing depth/stencil test/operation 

- Changing values of uniforms in shaders 


Challenge for Ul renderers to keep same API state even when changing what, how and 
where to draw 
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Fast Ul Draw aims to have a single pipeline state 


・ Standard Canvas Features: - We also have features that are beyond canvas: 


・ 3x3 transformation l , l 
・ Custom shaders support (for example, non-linear distortions 


・ Brush (image, gradients, brush transformation) without needing to resort to render to texture first) 

e Stroking paths (with and without anti-aliasing) e Stroke width can also be specified in pixels, even for non- 
orthogonal transformations 

・ Filling paths 

e ClipIn 

* ClipOut 


・ Porter-Duff blend modes 
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Blend Aodes 
1. | say aims to have. Currently, FastUIDraw needs 3 pipeline states for 
the GL backend 


A. States are for supporting the 12 Porter-Duff blend modes (support all 12 
Porter-Duff blend modes with just 3 different pipeline states). The only 
differences of the state is the blend mode. 


B. Later, when support for W3C compositing modes are added, for some 
hardware, we can reduce the number of states to just ONE. 
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API Trace of Fast UI Draw rendering a frame is boring. 
Contents of an ENTIRE frame are: 


1. Map Buffer, Unmap buffer 
A. Repeat a few times if lots to draw 
2. Set GL API State (bind textures, use program, etc.) 


3. Repeat a few times (Same number as in 1A) 


A. Bind VAO, bind TBO, then repeat a few times: 
ile Set Blend State (only happens for changed involving half of the Porter-Duff blend modes) 


De glMulitDrawElements (or glDrawElements) 
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Fast UI Draw uses an Uber-Shader and Data Store 


1. Painter draw calls are always just: add data to mapped buffers 


a.Essentially 3 buffers: attribute buffer, index buffer and data store buffer 


・ Data store buffer stores values that are shared between triangles (such as transformation matrix, brush 
properties) 


e Attributes and indices values are copied from the “what” (aka PainterAttributeData) directly. 


2. When Painter draws are done, then 3D API draw calls are issued 
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High Level Architecture 
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A backend also includes implementing storage 
classes to hold the image, glyph and other data 
that the backend’s implementation of 
PainterBackend uses. 
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Data Store Buffer 


Data store buffer is a buffer which FastUIDraw packs data that a shader will unpack. Each call 
to a draw method of PainterPacker packs a header onto the data store buffer. The header 
contains what sub-shaders to use and offsets for various data: 


Ce o data is packed is tightly 


Unnormalized depth value specified so that creating a 
backend for other 3D API's (for 


Brush, Blend and Item sub-shader ID’s example Vulkan) is possible. 
Offset for 3x3 transformation matrix 
Offset for 4 clipping equations FastUIDraw exposes an 


interface so that data already 
packed onto the data store 
buffer can be reused by an 
application. 


Offset for brush data and item shader data 
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Goal: Minimize re-computation of Data 


e FastUIDraw distinguishes strongly between “what” and “how” to draw. 
1. The “what” to draw is computed once to be reused repeated 
2. The “how” to draw is small amounts of data fed to shaders. The how to draw 
includes: 
a. Brush (which image, which color stop sequence, gradient location, etc.) 
b. Position transformation 
c. Clipping 
d. Stroking parameters (stroke width, miter limit, dash pattern) 
3. Painter interface is organized so that it is natural to reuse data of “what” to 
draw. 
4. The “how’ to draw is data that lands on the data store buffer 
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Generate Data to Minimize Work to Draw Repeatedly 


Example: path stroking data 
・ Data created from a path for stroking is so that the same data can be directly copied to 
attribute and index buffers on changing stroke style 
Stroking width 
Miter Limit 
Dash pattern (Not yet completed!) 


Example: path fill data 
Paths are triangulated and broken into components keyed by fill rule 


Allows to have arbitrary fill rules when filling paths (or clipping by filled paths) and CPU usage is 
minimal after first time path is used for filling. 
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Issue: Many images in a Ul scene 


1. The idea of an image in a 3D API is embodied by the idea of a texture 


2. Changing what texture (or textures) to use changes a portion of the pipeline state (namely 
the binding tables) 


3. Ul scenes contain many images of various sizes 
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Solution: Image Atlas for Padded Tiles 


Software 


Images are broken into tiles of same size(size of tile is configurable but constant for 
lifetime of program) 


Deallocating and allocating images is now trivial since we just need to know if enough tiles 
are free. 


An image has also an index tile. The index tile gives the locations into the color tile texture 
to use. 


The index tile values can also instead refer to an index tile instead. This allows us to 
support huge images (beyond hardware max texture size) without issue (subject to 
memory availability). 


Giving padding to the color tiles allows us to use the GPU sampler to perform bilinear 
filtering and to also leverage it to accelerate bicubic filtering. 
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Picture of color atlas (taken from image-test) 
(Images packed are from Blob Wars’ gfx/cutscenes) 
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Issue: Clipping 


1. Canvas rendering means we can clipin or clipOut against paths. 


2. We aim to leverage GPU to handle the clipping without requiring the CPU to “compute” or 
track the clipping region. 
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Solution: HW Accelerated Clipping 


FastUIDraw does NOT track the contents of the clipping region. FastUIDraw relies entirely 
on the GPU to implement clipping efficiently. 
・ The Depth buffer is used to implement clipOut: 
1. Path by which to clipOut is set to draw before that which it occludes 
2. The depth value (stored as a single value in data store buffer) is selected after the items it 
occludes are placed on the attribute, index and data store buffers 
e clipln by a rectangle is implemented by HW clip planes (in GL, gl ClipDistance) 
If transformation between last clipIn does NOT preserve the x-axis and y-axis, then complement of 
previous clipln rectangle is 4 quads by which we clipOut. These quads are clipped to the new 
clipping rectangle to minimize how many depth pixels are touched. 
ClipIn by filled path is just: 
1. clipin by rectangle of bounding box of filled path 
2. clipOut by complement fill rule of path 
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Current State 


A. Most Canvas features ready: 

3x3 transformation 

Brush (image, linear gradient, radial gradient, brush transformation) 
Stroking with or without anti-aliasing 

Filling with arbitrary fill rule 

clipIn by Path or rectangle 

clipOut by Path 

GPU glyph rendering 
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B. What is not yet done 


1. Glyph renderer has issues with some glyphs, need to implement “more expensive” glyph renderer 
for these glyphs 


2. Anti-aliasing for filling paths 
3. Dashed stroking (in progress) 
4. Productization (!) 
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Try it out! Give me Feedback! 


Clone from https://github.com/01 org/fastuidraw 
Read the docs (make docs) 

Play with the demos 

You can even “make install” 

Send feedback to me 

A. through GitHub page (issues, etc.) 

B. or my email: kevin.rogovin@intel.com 
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